Goto

Collaborating Authors

 chi-squared distribution


Privacy Amplification Persists under Unlimited Synthetic Data Release

Pierquin, Clément, Bellet, Aurélien, Tommasi, Marc, Boussard, Matthieu

arXiv.org Machine Learning

We study privacy amplification by synthetic data release, a phenomenon in which differential privacy guarantees are improved by releasing only synthetic data rather than the private generative model itself. Recent work by Pierquin et al. (2025) established the first formal amplification guarantees for a linear generator, but they apply only in asymptotic regimes where the model dimension far exceeds the number of released synthetic records, limiting their practical relevance. In this work, we show a surprising result: under a bounded-parameter assumption, privacy amplification persists even when releasing an unbounded number of synthetic records, thereby improving upon the bounds of Pierquin et al. (2025). Our analysis provides structural insights that may guide the development of tighter privacy guarantees for more complex release mechanisms.


Bayesian Optimization of Robustness Measures Using Randomized GP-UCB-based Algorithms under Input Uncertainty

Inatsu, Yu

arXiv.org Machine Learning

Bayesian optimization based on Gaussian process upper confidence bound (GP-UCB) has a theoretical guarantee for optimizing black-box functions. Black-box functions often have input uncertainty, but even in this case, GP-UCB can be extended to optimize evaluation measures called robustness measures. However, GP-UCB-based methods for robustness measures include a trade-off parameter $\beta$, which must be excessively large to achieve theoretical validity, just like the original GP-UCB. In this study, we propose a new method called randomized robustness measure GP-UCB (RRGP-UCB), which samples the trade-off parameter $\beta$ from a probability distribution based on a chi-squared distribution and avoids explicitly specifying $\beta$. The expected value of $\beta$ is not excessively large. Furthermore, we show that RRGP-UCB provides tight bounds on the expected value of regret based on the optimal solution and estimated solutions. Finally, we demonstrate the usefulness of the proposed method through numerical experiments.


Active Learning for Level Set Estimation Using Randomized Straddle Algorithms

Inatsu, Yu, Takeno, Shion, Kutsukake, Kentaro, Takeuchi, Ichiro

arXiv.org Machine Learning

Level set estimation (LSE), the problem of identifying the set of input points where a function takes value above (or below) a given threshold, is important in practical applications. When the function is expensive-to-evaluate and black-box, the \textit{straddle} algorithm, which is a representative heuristic for LSE based on Gaussian process models, and its extensions having theoretical guarantees have been developed. However, many of existing methods include a confidence parameter $\beta^{1/2}_t$ that must be specified by the user, and methods that choose $\beta^{1/2}_t$ heuristically do not provide theoretical guarantees. In contrast, theoretically guaranteed values of $\beta^{1/2}_t$ need to be increased depending on the number of iterations and candidate points, and are conservative and not good for practical performance. In this study, we propose a novel method, the \textit{randomized straddle} algorithm, in which $\beta_t$ in the straddle algorithm is replaced by a random sample from the chi-squared distribution with two degrees of freedom. The confidence parameter in the proposed method has the advantages of not needing adjustment, not depending on the number of iterations and candidate points, and not being conservative. Furthermore, we show that the proposed method has theoretical guarantees that depend on the sample complexity and the number of iterations. Finally, we confirm the usefulness of the proposed method through numerical experiments using synthetic and real data.


Towards Exact Computation of Inductive Bias

Boopathy, Akhilan, Yue, William, Hwang, Jaedong, Iyer, Abhiram, Fiete, Ila

arXiv.org Machine Learning

Much research in machine learning involves finding appropriate inductive biases (e.g. convolutional neural networks, momentum-based optimizers, transformers) to promote generalization on tasks. However, quantification of the amount of inductive bias associated with these architectures and hyperparameters has been limited. We propose a novel method for efficiently computing the inductive bias required for generalization on a task with a fixed training data budget; formally, this corresponds to the amount of information required to specify well-generalizing models within a specific hypothesis space of models. Our approach involves modeling the loss distribution of random hypotheses drawn from a hypothesis space to estimate the required inductive bias for a task relative to these hypotheses. Unlike prior work, our method provides a direct estimate of inductive bias without using bounds and is applicable to diverse hypothesis spaces. Moreover, we derive approximation error bounds for our estimation approach in terms of the number of sampled hypotheses. Consistent with prior results, our empirical results demonstrate that higher dimensional tasks require greater inductive bias. We show that relative to other expressive model classes, neural networks as a model class encode large amounts of inductive bias. Furthermore, our measure quantifies the relative difference in inductive bias between different neural network architectures. Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.


Bounding Reconstruction Attack Success of Adversaries Without Data Priors

Ziller, Alexander, Riess, Anneliese, Schwethelm, Kristian, Mueller, Tamara T., Rueckert, Daniel, Kaissis, Georgios

arXiv.org Artificial Intelligence

Reconstruction attacks on machine learning (ML) models pose a strong risk of leakage of sensitive data. In specific contexts, an adversary can (almost) perfectly reconstruct training data samples from a trained model using the model's gradients. When training ML models with differential privacy (DP), formal upper bounds on the success of such reconstruction attacks can be provided. So far, these bounds have been formulated under worst-case assumptions that might not hold high realistic practicality. In this work, we provide formal upper bounds on reconstruction success under realistic adversarial settings against ML models trained with DP and support these bounds with empirical results. With this, we show that in realistic scenarios, (a) the expected reconstruction success can be bounded appropriately in different contexts and by different metrics, which (b) allows for a more educated choice of a privacy parameter.


Statistics (III) ANOVA in Data Science & Machine Learning

#artificialintelligence

For the last part of the Statistics series, we will cover the ANOVA, Post-hoc Pairwise Comparison, Two-way ANOVA, and R-squared. Previously, our study focused on one or two groups of subjects. How can we handle the concept of multiple groups with multiple factors? For example, the dose level and gender may impact the effectiveness of a vaccine. How can we determine whether it is statistically significant for particular combinations?


Characterizing Deep Gaussian Processes via Nonlinear Recurrence Systems

Tong, Anh, Choi, Jaesik

arXiv.org Machine Learning

Recent advances in Deep Gaussian Processes (DGPs) show the potential to have more expressive representation than that of traditional Gaussian Processes (GPs). However, there exists a pathology of deep Gaussian processes that their learning capacities reduce significantly when the number of layers increases. In this paper, we present a new analysis in DGPs by studying its corresponding nonlinear dynamic systems to explain the issue. Existing work reports the pathology for the squared exponential kernel function. We extend our investigation to four types of common stationary kernel functions. The recurrence relations between layers are analytically derived, providing a tighter bound and the rate of convergence of the dynamic systems. We demonstrate our finding with a number of experimental results.


Famous Probability Distributions in Data Science

#artificialintelligence

Data Scientists are modern-day statisticians that take a shot on complex business problems and unravel them with the assistance of data. Probability Distributions allow a Data Scientist or Data Analyst to recognize patterns in any case totally random variables. A normal distribution is generally described as the bell-shaped curve and it depicts the recurrence of something that you are evaluating, such as the class scores. The focal point of the bend is the mean and the curve width called the standard deviation. The score happens most every now and again is the mean.


Model Specification Test with Unlabeled Data: Approach from Covariate Shift

Kato, Masahiro, Kawarazaki, Hikaru

arXiv.org Machine Learning

We propose a novel framework of the model specification test in regression using unlabeled test data. In many cases, we have conducted statistical inferences based on the assumption that we can correctly specify a model. However, it is difficult to confirm whether a model is correctly specified. To overcome this problem, existing works have devised statistical tests for model specification. Existing works have defined a correctly specified model in regression as a model with zero conditional mean of the error term over train data only. Extending the definition in conventional statistical tests, we define a correctly specified model as a model with zero conditional mean of the error term over any distribution of the explanatory variable. This definition is a natural consequence of the orthogonality of the explanatory variable and the error term. If a model does not satisfy this condition, the model might lack robustness with regards to the distribution shift. The proposed method would enable us to reject a misspecified model under our definition. By applying the proposed method, we can obtain a model that predicts the label for the unlabeled test data well without losing the interpretability of the model. In experiments, we show how the proposed method works for synthetic and real-world datasets.


Common Probability Distributions – Sean Owen – Medium

#artificialintelligence

Data scientists have hundreds of probability distributions from which to choose. Data science, whatever it may be, remains a big deal. "A data scientist is better at statistics than any software engineer," you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn't been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don't make fun. Suddenly it's you, the engineer, left out of the chat about confidence intervals instead of tutting at the analysts who have never heard of the Apache Bikeshed project for distributed comment formatting.